Principal Component Analysis

Principal component analysis (PCA) is arguably the first method for dimension reduction, which dates back to papers by some of the earliest contributors to statistical theory including Karl Pearson and Harold Hotelling [@pca_pearson; @pca_hotelling]. Pearson’s original development of PCA borrowed ideas from mechanics which provides a clear geometric/physical interpretation of the resulting PCA loadings, variances, and scores, which we will define later. This interpretability and an implementation that uses scalable linear algebra methods – allowing PCA to be conducted on massive datasets – is one of the reasons PCA is still used prolifically to this day. In fact, many more modern and complex methods still rely on PCA as an internal step in their algorithmic structure.

There are number of different but equivalent derivations of PCA including the minimization of least squares error, covariance decompositions, and low rank approximations. We’ll revisit these ideas later, but first, let’s discuss PCA through a geometrically motivated lens via a method called iterative projections.

Derivation 1: Iterative Projections

We begin with a data matrix \[{\bf X} = \begin{bmatrix} \vec{x}_1^T\\ \vdots \\\vec{x}_N^T\end{bmatrix} \in\mathbb{R}^{N\times d}.\] Let’s begin with an example of dimension reduction where we’ll seek to replace each vector \(\vec{x}_1,\dots,\vec{x}_N\) with corresponding scalars \(y_1,\dots,y_N\) which preserve as much of the variability between these vectors as possible. To formalize this idea, let’s introduce a few assumptions.

First, we’ll assume the data \(\vec{x}_1,\dots,\vec{x}_N\) are centered. This is not a requirement, but it will simplify the analysis later. We’ll discuss how to account for this centering step later, but for now assume \(\bar{x} = \vec{0}\) so that \({\bf HX} = {\bf X}\). More importantly, let’s assume that each \(y_i\) is derived in the same way. Specifically, let \(y_i = \vec{x}_i^T \vec{w}\) for some common vector \(\vec{w}\). Thus, we can view each one-dimensional representation as a dot product of the corresponding observed vector with the same vector \(\vec{w}.\) We can compactly write this expression as \[\vec{y} = \begin{bmatrix}y_1\\ \vdots \\ y_n \end{bmatrix}=\begin{bmatrix}\vec{x}_1^T \vec{w} \\ \vdots \\ \vec{x}_N^T \vec{w}\end{bmatrix} = {\bf X} \vec{w}.\]

How do we choose \(\vec{w}\)? We would like differences in the scalars \(y_1,\dots,y_N\) to reflect differences in the vectors \(\vec{x}_1,\dots,\vec{x}_N\) so having \(y_1,\dots,y_N\) spread out is a natural goal. Thus, if \(\vec{x}_i\) and \(\vec{x}_j\) are far apart then so will \(y_i\) and \(y_j\). To do this, we’ll try to maximize the sample variance of the \(y\)’s. The sample variance \[\frac{1}{N} \sum_{i=1}^N (y_i - \bar{y})^2 = \frac{1}{N}\sum_{i=1}^N(\vec{x}_i^T \vec{w} - \bar{y})^2\] will depend on our choice of \(\vec{w}\). In the previous expression, \[\bar{y} = \frac{1}{N} y_i = \frac{1}{N}\sum_{i=1}^N \vec{x}_i^T \vec{w} = \frac{1}{N}\vec{1}^T{\bf X}\vec{w}\] is the sample mean of \(y_1,\dots,y_N.\) Importantly, since we have assumed that \(\vec{x}_1,\dots,\vec{x}_N\) are centered, it follows that \(\bar{y}=0\) and the sample variance of \(y_1,\dots,y_N\) simplifies to \[\frac{1}{N}\sum_{i=1}^N(\vec{x}_i^T \vec{w})^2 = \frac{1}{N}\sum_{i=1}^N y_i^2 = \frac{1}{N} \|y\|^2 = \frac{1}{N}\vec{y}^T\vec{y}.\]

We can write the above expression more compactly. Using the identity \(\vec{y} = {\bf X}\vec{w}\), we want to choose \(\vec{w}\) to maximize \[\frac{1}{N}\vec{y}^T\vec{y} = \frac{1}{N}({\bf X}\vec{w})^T{\bf X}\vec{w} = \frac{1}{N}\vec{w}^T{\bf X}^T{\bf X}\vec{w} = \vec{w}^T\left(\frac{{\bf X}^T{\bf X}}{N}\right)\vec{w}.\] Since we have assumed that \({\bf X}\) is centered it follows that \({\bf X}^T{\bf X}/N\) is the sample covariance matrix \(\hat{\bf \Sigma}\)! Thus, we want to make \(\vec{w}^T\hat{\bf \Sigma} \vec{w}\) as large as possible.

Naturally, we could increase the entries in \(\vec{w}\) and increase the above expression without bound. To make the maximization problem well posed, we will restrict \(\vec{w}\) to be unit-length under the Euclidean norm so that \(\|\vec{w}\|=1.\) We now have a constrained optimization problem which gives rise to the first principal component loading.

The first principal component loading is the vector \(\vec{w}_1\) solving the constrained optimization problem \[\begin{equation} \begin{split} \text{Maximize } &\vec{w}^T \hat{\bf \Sigma}\vec{w} \\ \text{subject to constraint } &\|\vec{w}\|=1. \end{split} \end{equation}\] The first principal component scores are the projections, \(y_i = \vec{x}_i^T\vec{w}_1\) for \(i=1,\dots, N\), of each sample onto the first loading.

To find the first PCA loading we can make use of Lagrange multipliers (see exercises) to show that \(\vec{w}_1\) must also satisfy the equation \[\hat{\bf \Sigma}\vec{w}_1 = \lambda \vec{w}_1\] where \(\lambda\) is the Lagrange multiplier. From this expression, we can conclude that the first principal component loading is the unit length eigenvector associated with the largest eigenvalue of the sample covariance matrix \(\hat{\bf \Sigma}\) and that the Lagrange multiplier \(\lambda\) is the largest eigenvalue of \(\hat{\bf \Sigma}\). In this case, we refer to \(\lambda\) as the first principal component variance.

Geometric Interpretation of \(\vec{w}_1\)

Since \(\|\vec{w}_1\| = 1\) we may interpret this vector as specifying a direction in \(\mathbb{R}^d\). Additionally, we can decompose each of our samples into two pieces: one pointing in the direction specified by \(\vec{w}_1\) and a second portion perpendicular to this direction. Thus, we may write \[\vec{x}_i = \underbrace{\vec{w}_1 \vec{x}_i^T\vec{w}_1}_{parallel} + \underbrace{(\vec{x}_i -\vec{w}_1 \vec{x}_i^T\vec{w}_1)}_{perpendicular}.\] By the Pythagorean theorem, \[\begin{align*} \|\vec{x}_i\|^2 &= \| \vec{w}_1 \vec{x}_i^T\vec{w}_1 \|^2 + \|\vec{x}_i -\vec{w}_1 \vec{x}_i^T\vec{w}_1\|^2 \\ &= (\vec{w}_1^T\vec{x}_i)^2 + \|\vec{x}_i -\vec{w}_1 \vec{x}_i^T\vec{w}_1\|^2 \\ &= y_i^2 + \|\vec{x}_i -\vec{w}_1 \vec{x}_i^T\vec{w}_1\|^2 \end{align*}\] for \(i=1,\dots,N\). Averaging over all of samples gives the expression \[\frac{1}{N}\sum_{i=1}^N\|\vec{x}_i\|^2 = \frac{1}{N}\sum_{i=1}^N y_i^2 +\frac{1}{N}\sum_{i=1}^N \|\vec{x}_i -\vec{w}_1 \vec{x}_i^T\vec{w}_1\|^2.\] The left-hand side of the above expression is fixed for a given set of data, whereas the first term on the right side is exactly what we sought to maximize when finding the first principal component loading. This quantity is the average squared length of the projection of each sample onto the direction \(\vec{w}_1\). As such, we can view the first principal component loading as the direction in which \(\vec{x}_1,\dots,\vec{x}_N\) most greatly varies. Let’s turn to an example in \(\mathbb{R}^3\) to view this.

Below, we show a scatterplot of \(N=500\) points in \(\mathbb{R}^3\) drawn randomly from a MVN. These data have been centered.

Notice the oblong shape of the cloud of points. Rotating this image, it is clear that the data varies more in certain directions than in others.

The sample covariance matrix of these data is \[ \hat{\Sigma} = \begin{bmatrix} 11.83&6.78&3.38 \\ 6.78&8.88&9.2 \\ 3.38&9.2&15.9 \\ \end{bmatrix} \] We can use the eigendecomposition of this matrix to find the first PCA loading and variance. Given the first loading we can also compute the first PCA scores. A short snippet of code for this calculation is shown below.

Sigmahat <- (t(data) %*% data)/N # calculate the sample covariance matrix, recall the data has been centered 
Sigmahat.eigen <- eigen(Sigmahat) # calculate the eigen decomposition of Sigmahat
y <- data %*% Sigmahat.eigen$vectors[,1] # calculate the scores 

The largest eigenvalue (first PC variance) is \(\lambda = 25.58\) with associated eigenvector \((0.45, 0.56, 0.69)^T\) which is the first PC loading.

We can visualize these results in a few different ways. First, we can add the span of \(\vec{w}_1\) (shown in red) to the scatterplot of the data. One can see that \(\vec{w}_1\) is oriented along the direction where the data is most spread out.

Samples with Span of First Loading

We could also generate a scatterplot of the scores, but we’ll show these scores along the \(\vec{w}_1\) axes so that they correspond to the projection of each sample onto span\(\{\vec{w}_1\}\); equivalently, we are plotting the vectors \(y_i \vec{w}_i\).

Decomposition of Samples into Components Parallel and Perpendicular to First Loading

Additional Principal Components

The first PCA loading provides information about the direction in which are data most greatly vary, but it is quite possible that there are still other directions wherein our data still exhibits a lot of variability. In fact, the notion of a first principal component loading, scores, and variance suggests the existence of a second, third, etc. iteration of these quantities. To explore these quantities, let’s proceed as follows

For each datum, we can remove its component in the direction of \(\vec{w}_1\), and focus on the projection onto the orthogonal complement of \(\vec{w}_1\). Let \[\vec{x}_i^{(1)} = \vec{x}_i - \vec{w}_1\vec{x}_i^T\vec{w}_1 = \vec{x}_i - \vec{w}_1 y_i\] denote the portion of \(\vec{x}_i\) which is orthogonal to \(\vec{w}_1\). These points are shown on the right side of Fig @{fig:pca1}. Here, the superscript \(^{(1)}\) indicates we have removed the portion of each vector in the direction of the first loading.

We can organize the orthogonal components into a new data matrix \[{\bf X}^{(1)} = \begin{bmatrix} \left(\vec{x}_1^{(1)}\right)^T \\ \vdots \\ \left(\vec{x}_N^{(1)}\right)^T \end{bmatrix} = \begin{bmatrix} \vec{x}_1^T - \vec{x}_1^T\vec{w}_1\vec{w}_1^T \\ \vdots \\ \vec{x}_N^T - \vec{x}_N^T\vec{w}_1\vec{w}_1^T \end{bmatrix} = {\bf X} - {\bf X}\vec{w}_1\vec{w}_1^T.\] Now let’s apply PCA to the updated data matrix \({\bf X}^{(1)}\) from which we get the second principal component loading, denoted \(\vec{w}_2\), the second principal component scores, and the second principal component variance. One can show that the data matrix \({\bf X}^{(1)}\) is centered so that its sample covariance matrix is \[\begin{equation} \hat{\bf \Sigma}^{(1)} = \frac{1}{N}({\bf X}^{(1)})^T{\bf X}^{(1)}. \end{equation}\] The 2nd PC variance is the largest eigenvalue of \(\hat{\bf \Sigma}^{(1)}\) and its associated unit length eigenvector is the 2nd PC loading, denoted \(\vec{w}_2\). The second PC scores \(\vec{w}_2^T\vec{x}_i^{(1)}\) for \(i=1,\dots,N.\)

Continuing @ref(ex:pca-demo-1), we have \[ \hat{\bf \Sigma}^{(1)} = \begin{bmatrix} 6.68&0.3&-4.57 \\ 0.3&0.74&-0.8 \\ -4.57&-0.8&3.61 \\ \end{bmatrix} \] which has largest eigenvalue 10.02 (2nd PC variance) and associated eigenvector \(\vec{w}_2 = (0.81, 0.08, -0.59)^T\) (2nd PC Loading). The second set of PC scores are given by \(\vec{w}_2^T\vec{x}_i^{(1)}\) for \(i=1,\dots, N.\)

Here is one crucial observation. The vector \(\vec{w}_2\) gives the direction of greatest variability of the vectors \(\vec{x}_1^{(1)},\dots,\vec{x}_N^{(1)}.\) For each of these vectors we have removed the component in the direction of \(\vec{w}_1\). Thus, \(\vec{x}_1^{(1)},\dots,\vec{x}_N^{(1)}\) do not vary at all in the \(\vec{w}_1\) direction. What can we say about \(\vec{w}_2\)? Naturally, it must be perpendicular to \(\vec{w}_1\)! We can see this geometric relationship if we plot the vectors \(\vec{x}^{(1)}_i\) in our previous example along with the span of the first and second loading.

First and 2nd Loadings with Data after $ ec{w}_1$ component removed

We need not stop at the second PCA loading, scores, and variance. We could remove components in the direction of \(\vec{w}_2\) and apply PCA to the vectors \[\begin{align*} \vec{x}_i^{(2)} &= \vec{x}_i^{(1)} - \vec{w}_2 (\vec{x}_i^{(1)})^T\vec{w}_2\\ &= \vec{x}_i - \vec{w}_1\vec{x}_i^T\vec{w}_1 - \vec{w}_2(\vec{x}_i - \vec{w}_1\vec{x}_i^T\vec{w}_1)^T\vec{w}_2\\ &= \vec{x}_i - \vec{w}_1\vec{x}_i^T\vec{w}_1 - \vec{w}_2\vec{x}_i^T\vec{w}_2 + \vec{w}_2\vec{w_1}^T\vec{x}_i\underbrace{\vec{w}_1^T\vec{w}_2}_{=0} \end{align*}\] which gives rise to the centered data matrix \[\begin{equation} {\bf X}^{(2)} = \begin{bmatrix} \vec{x}_1^T- \vec{w}_1^T\vec{x}_1\vec{w}_1^T - \vec{w}_2^T\vec{x}_1\vec{w}_2^T \\ \vdots \\ \vec{x}_d^T- \vec{w}_1^T\vec{x}_d\vec{w}_1^T - \vec{w}_2^T\vec{x}_d\vec{w}_2^T \end{bmatrix} = {\bf X} - {\bf X}\vec{w}_1\vec{w}_1^T - {\bf X}\vec{w}_2\vec{w}_2^T \end{equation}\] with corresponding covariance matrix \[\hat{\Sigma}^{(2)} = \frac{1}{N}({\bf X}^{(2)})^T{\bf X}^{(2)}\] from which we can obtain a third loading (\(\vec{w}_3\)), variance (\(\lambda_3)\), and set of scores \(\vec{w}_3\vec{x}_i^{(2)}\) for \(i=1,\dots,N\).

We can continue repeating this argument \(d\) times for our \(d\)-dimensional data until we arrive at a set of \(d\) unit vectors \(\vec{w}_1,\dots,\vec{w}_d\) which are the \(d\) PCA loadings. Why do we stop after \(d\)? The principal component loadings \(\vec{w}_1,\dots,\vec{w}_d\) are all mutually orthogonal and unit length so they form an orthonormal basis for \(\mathbb{R}^d\). All of the variability in each sample can be expressed in terms of these \(d\) basis vectors.

This iterative approach is admittedly, to intensive for most practical applications. Fortunately, we do not need to follow the sequence of projections and then eigenvalue computations thanks to the following theorem.

Suppose \(\vec{w}_1,\dots,\vec{w}_d\) are the orthonormal eigenvectors of the sample covariance \(\hat{\Sigma}=\frac{1}{N}{\bf X}^T{\bf X}\) with eigenvalues \(\lambda_1\ge \dots\ge \lambda_d \ge 0\) respectively, i.e. \(\hat{\Sigma}\vec{w}_j = \lambda_j\vec{w}_j\). Then \(\vec{w}_1,\dots,\vec{w}_d\) are also eigenvectors or the matrix \(\hat{\Sigma}^{(1)}=\frac{1}{N}({\bf X}^{(1)})^T{\bf X}^{(1)}\) with eigenvalues \(0,\lambda_2,\dots,\lambda_d\) respectively.

This result follows nicely from the geometric observation that the loadings are mutually orthogonal. The second loading is then the eigenvector associated with the second largest eigenvalue of the original covariance matrix. Taking the orthogonality of the loadings further, we can get even more from the following corollary.

Suppose \(\vec{w}_1,\dots,\vec{w}_d\) are the orthonormal eigenvectors of the sample covariance \(\hat{\Sigma}=\frac{1}{N}{\bf X}^T{\bf X}\) with eigenvalues \(\lambda_1\ge \dots\ge \lambda_d \ge 0\) respectively. Then for \(k=2,\dots,d-1\), the vectors\(\vec{w}_1,\dots,\vec{w}_d\) are also eigenvectors of the matrix \(\hat{\Sigma}^{(k)}=\frac{1}{N}({\bf X}^{(k)})^T{\bf X}^{(k)}\) with eigenvalues \(\underbrace{0,\dots,0}_k,\lambda_{k+1},\dots,\lambda_d\) respectively.

As a result, we can immediately compute the PCA variances and loadings given the full spectral (eigenvalue) decomposition \[\begin{equation} \hat{\Sigma} = \begin{bmatrix} \vec{w}_1 & | & \dots & | &\vec{w}_d \end{bmatrix} \begin{bmatrix} \lambda_1 &0 & 0 \\ 0 &\ddots & 0 \\ 0 & 0 &\lambda_d \end{bmatrix} \begin{bmatrix} \vec{w}_1^T \\ \vdots \\\vec{w}_d^T \end{bmatrix}. \end{equation}\]

Again, thanks to orthogonality, we can also compute the PCA scores directly from the original data without iteratively removing components of each vector in the direction of the various loadings. To summarize PCA, given data \({\bf X}\) we first compute the spectral decomposition of the sample covariance matrix \(\hat{\bf \Sigma}\). The eigenvalues (in decreasing magnitude) provide the PC variance and the corresponding unit-length eigenvectors give the corresponding loadings, which form an orthonormal basis in \(\mathbb{R}^d\). The PC scores are the inner product of each vector with these loadings (assuming the data are centered). Thus, the PCA scores are \(\vec{x}_i^T\vec{w}_1,\dots,\vec{x}_i^T\vec{w}_d\) for \(i=1,\dots,d.\) We can compactly (again assuming our data are centered) express the scores in terms of the original data matrix and the loadings as

\[\begin{equation} {\bf Y} = {\bf XW} \end{equation}\]

where \({\bf Y}\in \mathbb{R}^{N\times d}\) is a matrix of PC scores and \({\bf W}\) is the orthonormal matrix with columns given by the loadings.

Geometric interpretation of PCA

We have begun with an \(N\times d\) data matrix \({\bf X}\) and now we have an \(N\times d\) matrix of PCA scores \({\bf Y}\). One may be inclined to ask: where is the dimension reduction? To answer this question, let’s first consider the sample covariance matrix of the PCA scores, which one can show is diagonal \[\begin{equation} \hat{\bf \Sigma}_Y = \begin{bmatrix} \lambda_1 & & \\ &\ddots & \\ && \lambda_d \end{bmatrix} \end{equation}\]

Derivation 2: Optimal Linear Subspace